Attrition Case Study for Frito Lay
A case study into the work force data set for MSDS 6306 by Renu Karthikeyan
CaseStudy2 Attrition
Renu Karthikeyan
November/December 2023
Statement of Purpose:
I am pleased to present this case study on Frito Lay attrition, which aims to analyze and derive actionable insights from the workforce data at Frito Lay. The purpose of this study is to understand the factors influencing employee attrition, develop predictive models for attrition risk, and evaluate the performance of the predictive models.
Objectives:
- Data Exploration and Visualization: Conduct a thorough exploration of the available workforce data to gain insights into the distribution of various attributes.Visualize key trends and patterns related to attrition, employee demographics, and other relevant factors.
- Predictive Modeling: Use K-nearest neighbors (KNN) and Naive Bayes, to build predictive models for employee attrition.Evaluate the performance of each model and identify the most effective approach in predicting attrition risk. Create linear regression model to predict salary for employees, given all other predictors.
- Shiny App Development: Create an interactive Shiny app to visualize and communicate the insights derived from the analysis.Provide a user-friendly platform for stakeholders to explore the data and understand the implications of the findings.
- Communication and Collaboration: Effectively communicate the results and recommendations to stakeholders through clear and concise reports and presentations.
Load and Install Libraries
library(dplyr)
library(ggplot2)
library(caret)
library(aws.s3)
library(RCurl)
library(readr)
library(base)
library(tidyverse)
library(naniar)
library(class)
library(GGally)
library(e1071)
library(car)
library(fastDummies) Set AWS credentials, and Load in Datasets from Amazon S3 Bucket
Here, loading in the data from the AWS S3 Bucket. I did some slight clean up of the data, to exclude the “Over18” column in the original data set and the Attrition test data set. The column name for ID in the No Salary data set was not the same, so I adjusted that. Also, I did a check for any missing values. There are no missing values in any of the data sets, so there is no need for imputing or deleting of rows.
# Load the Attrition data set from S3
#s3_path <- "s3://msds.ds.6306.2/CaseStudy2-data.csv"
# Read the Attrition Data CSV file from S3
Attritiondata <- read.table(textConnection(getURL("https://msdsds6306.s3.us-east-2.amazonaws.com/CaseStudy2-data.csv")), sep =",", header =T)
head(Attritiondata,5) ## ID Age Attrition BusinessTravel DailyRate Department
## 1 1 32 No Travel_Rarely 117 Sales
## 2 2 40 No Travel_Rarely 1308 Research & Development
## 3 3 35 No Travel_Frequently 200 Research & Development
## 4 4 32 No Travel_Rarely 801 Sales
## 5 5 24 No Travel_Frequently 567 Research & Development
## DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 1 13 4 Life Sciences 1 859
## 2 14 3 Medical 1 1128
## 3 18 2 Life Sciences 1 1412
## 4 1 4 Marketing 1 2016
## 5 2 1 Technical Degree 1 1646
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 2 Male 73 3 2
## 2 3 Male 44 2 5
## 3 3 Male 60 3 3
## 4 3 Female 48 3 3
## 5 1 Female 32 3 1
## JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1 Sales Executive 4 Divorced 4403
## 2 Research Director 3 Single 19626
## 3 Manufacturing Director 4 Single 9362
## 4 Sales Executive 4 Married 10422
## 5 Research Scientist 4 Single 3760
## MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
## 1 9250 2 Y No 11
## 2 17544 1 Y No 14
## 3 19944 2 Y No 11
## 4 24032 1 Y No 19
## 5 17218 1 Y Yes 13
## PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
## 1 3 3 80 1
## 2 3 1 80 0
## 3 3 3 80 0
## 4 3 3 80 2
## 5 3 3 80 0
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 1 8 3 2 5
## 2 21 2 4 20
## 3 10 2 3 2
## 4 14 3 3 14
## 5 6 2 3 6
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 1 2 0 3
## 2 7 4 9
## 3 2 2 2
## 4 10 5 7
## 5 3 1 3 #summary(Attritiondata)
vis_miss(Attritiondata) #checking for no missing values #Read CSV "testing" files from S3
# Reading in of NoSalary Dataset from S3 Bucket
NoSalary<-read.table( textConnection(getURL
("https://msdsds6306.s3.us-east-2.amazonaws.com/CaseStudy2CompSet+No+Salary.csv"
)), sep=",", header=TRUE)
head(NoSalary,5) ## ï..ID Age Attrition BusinessTravel DailyRate Department
## 1 871 43 No Travel_Frequently 1422 Sales
## 2 872 33 No Travel_Rarely 461 Research & Development
## 3 873 55 Yes Travel_Rarely 267 Sales
## 4 874 36 No Non-Travel 1351 Research & Development
## 5 875 27 No Travel_Rarely 1302 Research & Development
## DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 1 2 4 Life Sciences 1 1849
## 2 13 1 Life Sciences 1 995
## 3 13 4 Marketing 1 1372
## 4 9 4 Life Sciences 1 1949
## 5 19 3 Other 1 1619
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 1 Male 92 3 2
## 2 2 Female 53 3 1
## 3 1 Male 85 4 4
## 4 1 Male 66 4 1
## 5 4 Male 67 2 1
## JobRole JobSatisfaction MaritalStatus MonthlyRate
## 1 Sales Executive 4 Married 19246
## 2 Research Scientist 4 Single 17241
## 3 Sales Executive 3 Single 9277
## 4 Laboratory Technician 2 Married 9238
## 5 Laboratory Technician 1 Divorced 16290
## NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating
## 1 1 Y No 20 4
## 2 3 Y No 18 3
## 3 6 Y Yes 17 3
## 4 1 Y No 22 4
## 5 1 Y No 11 3
## RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## 1 3 80 1 7
## 2 1 80 0 5
## 3 3 80 0 24
## 4 2 80 0 5
## 5 1 80 2 7
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1 5 3 7 7
## 2 4 3 3 2
## 3 2 2 19 7
## 4 3 3 5 4
## 5 3 3 7 7
## YearsSinceLastPromotion YearsWithCurrManager
## 1 7 7
## 2 0 2
## 3 3 8
## 4 0 2
## 5 0 7 #summary(NoSalary)
vis_miss(NoSalary) AttritionTest<- read.table(textConnection(getURL
("https://msdsds6306.s3.us-east-2.amazonaws.com/CaseStudy2CompSet+No+Attrition.csv"
)), sep=",", header=TRUE)
head(AttritionTest,5) ## ID Age BusinessTravel DailyRate Department DistanceFromHome
## 1 1171 35 Travel_Rarely 750 Research & Development 28
## 2 1172 33 Travel_Rarely 147 Human Resources 2
## 3 1173 26 Travel_Rarely 1330 Research & Development 21
## 4 1174 55 Travel_Rarely 1311 Research & Development 2
## 5 1175 29 Travel_Rarely 1246 Sales 19
## Education EducationField EmployeeCount EmployeeNumber
## 1 3 Life Sciences 1 1596
## 2 3 Human Resources 1 1207
## 3 3 Medical 1 1107
## 4 3 Life Sciences 1 505
## 5 3 Life Sciences 1 1497
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 2 Male 46 4 2
## 2 2 Male 99 3 1
## 3 1 Male 37 3 1
## 4 3 Female 97 3 4
## 5 3 Male 77 2 2
## JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## 1 Laboratory Technician 3 Married 3407 25348
## 2 Human Resources 3 Married 3600 8429
## 3 Laboratory Technician 3 Divorced 2377 19373
## 4 Manager 4 Single 16659 23258
## 5 Sales Executive 3 Divorced 8620 23757
## NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating
## 1 1 Y No 17 3
## 2 1 Y No 13 3
## 3 1 Y No 20 4
## 4 2 Y Yes 13 3
## 5 1 Y No 14 3
## RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## 1 4 80 2 10
## 2 4 80 1 5
## 3 3 80 1 1
## 4 3 80 0 30
## 5 3 80 2 10
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1 3 2 10 9
## 2 2 3 5 4
## 3 0 2 1 1
## 4 2 3 5 4
## 5 3 3 10 7
## YearsSinceLastPromotion YearsWithCurrManager
## 1 6 8
## 2 1 4
## 3 0 0
## 4 1 2
## 5 0 4 #summary(AttritionTest)
vis_miss(AttritionTest) Attritiondata <- subset(Attritiondata, select = -c(Over18))
head(Attritiondata,5) ## ID Age Attrition BusinessTravel DailyRate Department
## 1 1 32 No Travel_Rarely 117 Sales
## 2 2 40 No Travel_Rarely 1308 Research & Development
## 3 3 35 No Travel_Frequently 200 Research & Development
## 4 4 32 No Travel_Rarely 801 Sales
## 5 5 24 No Travel_Frequently 567 Research & Development
## DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 1 13 4 Life Sciences 1 859
## 2 14 3 Medical 1 1128
## 3 18 2 Life Sciences 1 1412
## 4 1 4 Marketing 1 2016
## 5 2 1 Technical Degree 1 1646
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 2 Male 73 3 2
## 2 3 Male 44 2 5
## 3 3 Male 60 3 3
## 4 3 Female 48 3 3
## 5 1 Female 32 3 1
## JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1 Sales Executive 4 Divorced 4403
## 2 Research Director 3 Single 19626
## 3 Manufacturing Director 4 Single 9362
## 4 Sales Executive 4 Married 10422
## 5 Research Scientist 4 Single 3760
## MonthlyRate NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating
## 1 9250 2 No 11 3
## 2 17544 1 No 14 3
## 3 19944 2 No 11 3
## 4 24032 1 No 19 3
## 5 17218 1 Yes 13 3
## RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## 1 3 80 1 8
## 2 1 80 0 21
## 3 3 80 0 10
## 4 3 80 2 14
## 5 3 80 0 6
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1 3 2 5 2
## 2 2 4 20 7
## 3 2 3 2 2
## 4 3 3 14 10
## 5 2 3 6 3
## YearsSinceLastPromotion YearsWithCurrManager
## 1 0 3
## 2 4 9
## 3 2 2
## 4 5 7
## 5 1 3 colnames(NoSalary)[colnames(NoSalary)=="ï..ID"] <- "ID"
colnames(NoSalary) ## [1] "ID" "Age"
## [3] "Attrition" "BusinessTravel"
## [5] "DailyRate" "Department"
## [7] "DistanceFromHome" "Education"
## [9] "EducationField" "EmployeeCount"
## [11] "EmployeeNumber" "EnvironmentSatisfaction"
## [13] "Gender" "HourlyRate"
## [15] "JobInvolvement" "JobLevel"
## [17] "JobRole" "JobSatisfaction"
## [19] "MaritalStatus" "MonthlyRate"
## [21] "NumCompaniesWorked" "Over18"
## [23] "OverTime" "PercentSalaryHike"
## [25] "PerformanceRating" "RelationshipSatisfaction"
## [27] "StandardHours" "StockOptionLevel"
## [29] "TotalWorkingYears" "TrainingTimesLastYear"
## [31] "WorkLifeBalance" "YearsAtCompany"
## [33] "YearsInCurrentRole" "YearsSinceLastPromotion"
## [35] "YearsWithCurrManager" AttritionTest <- subset(AttritionTest, select = -c(Over18))
head(AttritionTest,5) ## ID Age BusinessTravel DailyRate Department DistanceFromHome
## 1 1171 35 Travel_Rarely 750 Research & Development 28
## 2 1172 33 Travel_Rarely 147 Human Resources 2
## 3 1173 26 Travel_Rarely 1330 Research & Development 21
## 4 1174 55 Travel_Rarely 1311 Research & Development 2
## 5 1175 29 Travel_Rarely 1246 Sales 19
## Education EducationField EmployeeCount EmployeeNumber
## 1 3 Life Sciences 1 1596
## 2 3 Human Resources 1 1207
## 3 3 Medical 1 1107
## 4 3 Life Sciences 1 505
## 5 3 Life Sciences 1 1497
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 2 Male 46 4 2
## 2 2 Male 99 3 1
## 3 1 Male 37 3 1
## 4 3 Female 97 3 4
## 5 3 Male 77 2 2
## JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## 1 Laboratory Technician 3 Married 3407 25348
## 2 Human Resources 3 Married 3600 8429
## 3 Laboratory Technician 3 Divorced 2377 19373
## 4 Manager 4 Single 16659 23258
## 5 Sales Executive 3 Divorced 8620 23757
## NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating
## 1 1 No 17 3
## 2 1 No 13 3
## 3 1 No 20 4
## 4 2 Yes 13 3
## 5 1 No 14 3
## RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## 1 4 80 2 10
## 2 4 80 1 5
## 3 3 80 1 1
## 4 3 80 0 30
## 5 3 80 2 10
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1 3 2 10 9
## 2 2 3 5 4
## 3 0 2 1 1
## 4 2 3 5 4
## 5 3 3 10 7
## YearsSinceLastPromotion YearsWithCurrManager
## 1 6 8
## 2 1 4
## 3 0 0
## 4 1 2
## 5 0 4 EDA
Looking at relationships within overall data, not just those attrited
ggplot(data=Attritiondata, aes(x=JobSatisfaction)) +geom_bar(position="dodge") + theme_minimal() + ggtitle("Overall Job Satisfaction") Here, we look at overall Job Satisfaction among the employees. It looks like majority of employees seem to be satisfied with their job, with a handful not being as satisfied. It is a left skewed histogram.
ggplot(data=Attritiondata,aes(x=MonthlyIncome, y=Age)) + geom_point(position="jitter") + facet_wrap(~MaritalStatus)+geom_smooth(method="loess") + ggtitle("Monthly Income and Age categorized by Marital Status") Here, we see Monthly Income by Age categorized by Marital Status. Starting off, it looks like Divorced and Married people tend to make more as their age increases. However, the same positive trend is seen for Single people, but they do not make as much as their Married or Divorced coworkers. But there is definitely a positive trend between age and monthly income.
ggplot(data = Attritiondata, aes(x = MonthlyIncome, y = Age, color = JobInvolvement)) +
geom_point(position = "jitter") +
geom_smooth(method = lm) +
ggtitle("Job Involvement and Monthly Income") + facet_wrap(~JobInvolvement) Looking at job involvement and monthly Income, it looks most employees regardless of Job Involvement have a positive correlation between Age and Monthly Income. I was trying to see if those who were more involved in their job made a higher income, but that does not seem to be the case at first glance. It looks like a lot of employees are pretty involved in their jobs (Job Involvement levels 2 and 3).
ggplot(data = Attritiondata, aes(x = MonthlyIncome, y = Age, color = interaction(JobInvolvement))) +
geom_point(position = "jitter") +
geom_smooth(method = lm, se = FALSE) +
ggtitle("Job Involvement and Monthly Income") When I plotted all 4 job levels with their respective linear regression lines, it looks like those with least job involvement (Job Involvement 1) start to make more after crossing 40 years old and a monthly income of 10,000. It seems like Job Involvement 3 make the most starting off up to a monthly income of 10,000 and ~40 years old, and start to make the least as they near 50 years old. Job Involvements 2 and 4 linear regression lines fall in between the 1 and 4 lines initially and towards the end, they are very close to each otehr, and seem to overlap.
ggplot(data=Attritiondata,aes(x=MonthlyIncome,y=NumCompaniesWorked)) + geom_point(position="jitter") + geom_smooth(method=lm) + ggtitle("Number of Companies Worked and Monthly Income") There is positive correlation between Monthly Income and Number of Companies worked.
ggplot(data = Attritiondata, aes(x = MonthlyIncome)) + geom_histogram() + ggtitle("Monthly Income Histogram") Looking at a histogram of Monthly Income, it looks to be right skewed. The mode is less than the median which is less than the mean.
ggplot(data=Attritiondata,aes(x=MonthlyIncome)) + geom_histogram() + ggtitle("Monthly Income Histogram Categorized by Gender") + facet_wrap(~Gender) Looking at monthly income categorized by Gender, it looks like there are more male datapoints in the dataset than femalses. The mode is higher for men than it is for women. Both histograms are right skewed.
ggplot(data=Attritiondata,aes(x=MonthlyIncome, y=DistanceFromHome)) + geom_point(position="jitter") + geom_smooth(method=loess) + ggtitle("Monthly Income and Distance from Home") There seems to be a negative correlation between distance from home and monthly income. The plot and loess curve imply that as distance from home initially increases, monthly income also increases, but after a monthly income of 15,000 is exceeded, there is an overall decrease in the distance from home. The curve resembles a concave curve, with the downward slight ‘w’ shape. Looking at this graph, I would interpret distance from home to be in miles, because if this is in kilometers, it doesn’t seem to make logical sense.
ggplot(data=Attritiondata,aes(x=MonthlyIncome, y=DistanceFromHome)) + geom_point(position="jitter") + geom_smooth(method=loess) + facet_wrap(~Gender) + ggtitle("Monthly Income and Distance from Home categorized by Gender") There seems to be a more prominent downturned ‘w’ shape for women than for men. But the overall relationship is as mentioned above (between distance from home and Monthly Income).
ggplot(data=Attritiondata, aes(x=Department, y=JobSatisfaction, color = Gender)) + geom_point(position ="jitter") + facet_wrap(~Gender) + ggtitle("Department and Job Satisfaction by Gender") Job Satisfaction seems to be higher for Research and Development for both Males and Females. There are less data points for those in Human Resources so a clear defined relationship can’t be concluded. There seems to be decent job satisfaction for those in Sales too for both genders.
ggplot(data=Attritiondata,aes(x=Age, y=TotalWorkingYears, color = MonthlyIncome)) + geom_point(position = "jitter") + ggtitle("Age and Total Working Years with Monthly Income") This plot confirms that as Age increases, total working years also increase, and monthly income seems to follow the same trend as well. There is a positive correlation between all 3 variables - Age, Total working years, and monthly income.
Attrition Specific Analysis
I filtered the data to look at specifically those who Attrited, to find some insights and relationships between the variables.
attrition_yes <- dplyr::filter(Attritiondata, Attrition == "Yes") ggplot(data = attrition_yes, aes(x = Department, fill = Gender))+ geom_bar(position = "dodge") +
ggtitle("Attrition by Department and Gender") + theme_minimal() Of those who left the company, many men were in Research & Development and Sales, while there were equal amounts of women in Research & Development and Sales. There are equal amounts of men and women from Human resources who left the company.
ggplot(data = attrition_yes, aes(y = JobSatisfaction , x = DistanceFromHome, color = MonthlyIncome))+
geom_point(position = "jitter") + theme_minimal() + geom_smooth(method =lm) + ggtitle("Attrition by Distance from Home, Job Satisfaction with Monthly Income") Of those who left the company, it seems likes a lot of them were making under $10,000 monthly. There seems to be a negative relationship between Job Satisfaction and Distance from home (as distance from home increases in miles, the job satisfaction goes down).
ggplot(data = attrition_yes, aes(y=JobLevel, x = MonthlyIncome, color = Age))+
geom_point(position="jitter")+geom_smooth(method=lm) + ggtitle("Monthly Income and Job Level with Age") Of those who left the company, there is a positive relationship between job level and Monthly income. Age seems to be scattered, but at a job level of around 1, and monthly income less than 5000, the age group seems to have employees in their 20s.
ggplot(data = attrition_yes, aes(y=JobLevel, x = Age, color = Gender))+
geom_point(position="jitter")+geom_smooth(method=lm) + ggtitle("Attrition Job Level by Age and Gender") OF those who left the company, it looks like Job Level is positively correlated with Age for both Males and Females. The slope of the linear regression line for females seems to be more steep than it is for the males.
ggplot(data = attrition_yes, aes(x=OverTime, fill = Gender)) + geom_bar() + ggtitle("Attrition - Overtime by Gender") Of those who left the company, many employees were working over time. As mentioned earlier, there are more male data points compared to females, which is why at first glance it may seem a bit off. It looks like of the Over time group - approximately 70% were males, and 30% was females.
ggplot(data = attrition_yes, aes(x=NumCompaniesWorked, y = PercentSalaryHike)) + geom_point(position = "jitter") + theme_minimal()+geom_smooth(method = lm) + ggtitle ("Attrition - Percent Salary Hike and Number of Companies Worked") There seems to be a slight downward trend when looking at the relationship between Number of companies worked and Percent salary hike. I was trying to see if the number of companies worked affected the percent salary hike positively. At first glance, it seems like there is no variation int he line, but if you observe closely, there is a slight downward trend.
ggplot(data = attrition_yes, aes(x = PercentSalaryHike, fill = OverTime)) +
geom_histogram( binwidth = 1, color = "black", alpha = 0.7) +
geom_density(aes(y = ..count..), fill = "transparent", color = "darkblue") +
labs(title = "Histogram with Trend Line of % Salary Hike by Overtime",
x = "Percent Salary Hike", y = "Count") +
theme_minimal() + theme(legend.position = "top") # Adjust legend position This is a histogram of percent salary hike, with each bar being split and shaded by if the employee(s) were working over time. The overall percent salary hike (without the split) seems to be right skewed generally, but the shape of the line/trend seems to imply that it may be multimodal.
ggplot(data = attrition_yes, aes(x=YearsAtCompany, y = PercentSalaryHike, color = PerformanceRating)) + geom_point() + theme_minimal() + ggtitle("Salary Hike v. Years at Company with Performance Rating") This plot takes a look at Years at Company and Percent salary hike. Those with a higher performance rating seem to have a higher percent salary hike. For a Percent salary hike less than 20%, it seems like the performance rating is under 4, and trends around the 3 rating.
ggplot(data = attrition_yes, aes(y=JobSatisfaction, x = Education)) + geom_point(position="jitter") + theme_minimal() + ggtitle("Job Satisfaction v. Education Categorized by Education Field") + facet_wrap(~EducationField) + geom_smooth(method = lm) This plot looks at the relationship between Job Satisfaction based on Education categorized by education field. Most education fields seem to have a negative correlation between job satisfaction and years of education, except for the Medical field. The medical field is the only field where as education increases, the job satifaction also seems to increase.
#Building a Regression Model to Determine Salary
Regression Model 1 Using All Predictors to Determine Salary
Monthly Income is the “salary” variable
class(Attritiondata$MonthlyIncome) ## [1] "integer" sum(is.na(Attritiondata$MonthlyIncome)) ## [1] 0 summary(Attritiondata) ## ID Age Attrition BusinessTravel
## Min. : 1.0 Min. :18.00 Length:870 Length:870
## 1st Qu.:218.2 1st Qu.:30.00 Class :character Class :character
## Median :435.5 Median :35.00 Mode :character Mode :character
## Mean :435.5 Mean :36.83
## 3rd Qu.:652.8 3rd Qu.:43.00
## Max. :870.0 Max. :60.00
## DailyRate Department DistanceFromHome Education
## Min. : 103.0 Length:870 Min. : 1.000 Min. :1.000
## 1st Qu.: 472.5 Class :character 1st Qu.: 2.000 1st Qu.:2.000
## Median : 817.5 Mode :character Median : 7.000 Median :3.000
## Mean : 815.2 Mean : 9.339 Mean :2.901
## 3rd Qu.:1165.8 3rd Qu.:14.000 3rd Qu.:4.000
## Max. :1499.0 Max. :29.000 Max. :5.000
## EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction
## Length:870 Min. :1 Min. : 1.0 Min. :1.000
## Class :character 1st Qu.:1 1st Qu.: 477.2 1st Qu.:2.000
## Mode :character Median :1 Median :1039.0 Median :3.000
## Mean :1 Mean :1029.8 Mean :2.701
## 3rd Qu.:1 3rd Qu.:1561.5 3rd Qu.:4.000
## Max. :1 Max. :2064.0 Max. :4.000
## Gender HourlyRate JobInvolvement JobLevel
## Length:870 Min. : 30.00 Min. :1.000 Min. :1.000
## Class :character 1st Qu.: 48.00 1st Qu.:2.000 1st Qu.:1.000
## Mode :character Median : 66.00 Median :3.000 Median :2.000
## Mean : 65.61 Mean :2.723 Mean :2.039
## 3rd Qu.: 83.00 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :100.00 Max. :4.000 Max. :5.000
## JobRole JobSatisfaction MaritalStatus MonthlyIncome
## Length:870 Min. :1.000 Length:870 Min. : 1081
## Class :character 1st Qu.:2.000 Class :character 1st Qu.: 2840
## Mode :character Median :3.000 Mode :character Median : 4946
## Mean :2.709 Mean : 6390
## 3rd Qu.:4.000 3rd Qu.: 8182
## Max. :4.000 Max. :19999
## MonthlyRate NumCompaniesWorked OverTime PercentSalaryHike
## Min. : 2094 Min. :0.000 Length:870 Min. :11.0
## 1st Qu.: 8092 1st Qu.:1.000 Class :character 1st Qu.:12.0
## Median :14074 Median :2.000 Mode :character Median :14.0
## Mean :14326 Mean :2.728 Mean :15.2
## 3rd Qu.:20456 3rd Qu.:4.000 3rd Qu.:18.0
## Max. :26997 Max. :9.000 Max. :25.0
## PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
## Min. :3.000 Min. :1.000 Min. :80 Min. :0.0000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:80 1st Qu.:0.0000
## Median :3.000 Median :3.000 Median :80 Median :1.0000
## Mean :3.152 Mean :2.707 Mean :80 Mean :0.7839
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:80 3rd Qu.:1.0000
## Max. :4.000 Max. :4.000 Max. :80 Max. :3.0000
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## Min. : 0.00 Min. :0.000 Min. :1.000 Min. : 0.000
## 1st Qu.: 6.00 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000
## Median :10.00 Median :3.000 Median :3.000 Median : 5.000
## Mean :11.05 Mean :2.832 Mean :2.782 Mean : 6.962
## 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:10.000
## Max. :40.00 Max. :6.000 Max. :4.000 Max. :40.000
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 2.00
## Median : 3.000 Median : 1.000 Median : 3.00
## Mean : 4.205 Mean : 2.169 Mean : 4.14
## 3rd Qu.: 7.000 3rd Qu.: 3.000 3rd Qu.: 7.00
## Max. :18.000 Max. :15.000 Max. :17.00 # Identify character variables
char_vars <- sapply(Attritiondata, is.character)
# Convert character variables to factors
Attritiondata[, char_vars] <- lapply(Attritiondata[, char_vars], as.factor)
#Check Factor Levels for Categorical variables:
sapply(Attritiondata[,char_vars],levels) ## $Attrition
## [1] "No" "Yes"
##
## $BusinessTravel
## [1] "Non-Travel" "Travel_Frequently" "Travel_Rarely"
##
## $Department
## [1] "Human Resources" "Research & Development" "Sales"
##
## $EducationField
## [1] "Human Resources" "Life Sciences" "Marketing" "Medical"
## [5] "Other" "Technical Degree"
##
## $Gender
## [1] "Female" "Male"
##
## $JobRole
## [1] "Healthcare Representative" "Human Resources"
## [3] "Laboratory Technician" "Manager"
## [5] "Manufacturing Director" "Research Director"
## [7] "Research Scientist" "Sales Executive"
## [9] "Sales Representative"
##
## $MaritalStatus
## [1] "Divorced" "Married" "Single"
##
## $OverTime
## [1] "No" "Yes" #Noticed Over18 has only 1 factor level; so going to remove from dataset
# Attritiondata <- subset(Attritiondata, select = -c(Over18))
# Fit the linear regression model with all predictors
Model1_fit <- lm(MonthlyIncome ~ ., data = Attritiondata)
summary(Model1_fit) ##
## Call:
## lm(formula = MonthlyIncome ~ ., data = Attritiondata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3708.8 -674.1 14.7 614.1 4100.0
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.013e+01 7.772e+02 0.077 0.938349
## ID -2.343e-01 1.476e-01 -1.588 0.112713
## Age -1.431e+00 5.649e+00 -0.253 0.800110
## AttritionYes 8.904e+01 1.154e+02 0.771 0.440729
## BusinessTravelTravel_Frequently 1.895e+02 1.420e+02 1.334 0.182441
## BusinessTravelTravel_Rarely 3.720e+02 1.200e+02 3.099 0.002005 **
## DailyRate 1.452e-01 9.129e-02 1.591 0.112062
## DepartmentResearch & Development 1.234e+02 4.768e+02 0.259 0.795866
## DepartmentSales -4.594e+02 4.877e+02 -0.942 0.346580
## DistanceFromHome -6.237e+00 4.578e+00 -1.362 0.173417
## Education -3.743e+01 3.716e+01 -1.007 0.314105
## EducationFieldLife Sciences 1.352e+02 3.692e+02 0.366 0.714248
## EducationFieldMarketing 1.377e+02 3.914e+02 0.352 0.725050
## EducationFieldMedical 3.326e+01 3.699e+02 0.090 0.928376
## EducationFieldOther 9.152e+01 3.946e+02 0.232 0.816664
## EducationFieldTechnical Degree 9.680e+01 3.843e+02 0.252 0.801179
## EmployeeCount NA NA NA NA
## EmployeeNumber 8.681e-02 6.103e-02 1.422 0.155269
## EnvironmentSatisfaction -6.267e+00 3.364e+01 -0.186 0.852252
## GenderMale 1.100e+02 7.442e+01 1.478 0.139715
## HourlyRate -3.591e-01 1.824e+00 -0.197 0.844003
## JobInvolvement 1.677e+01 5.321e+01 0.315 0.752698
## JobLevel 2.783e+03 8.340e+01 33.375 < 2e-16 ***
## JobRoleHuman Resources -2.053e+02 5.157e+02 -0.398 0.690663
## JobRoleLaboratory Technician -5.891e+02 1.714e+02 -3.437 0.000618 ***
## JobRoleManager 4.280e+03 2.830e+02 15.122 < 2e-16 ***
## JobRoleManufacturing Director 1.809e+02 1.696e+02 1.067 0.286497
## JobRoleResearch Director 4.077e+03 2.193e+02 18.592 < 2e-16 ***
## JobRoleResearch Scientist -3.494e+02 1.705e+02 -2.049 0.040790 *
## JobRoleSales Executive 5.263e+02 3.576e+02 1.472 0.141449
## JobRoleSales Representative 8.531e+01 3.918e+02 0.218 0.827703
## JobSatisfaction 3.278e+01 3.344e+01 0.980 0.327278
## MaritalStatusMarried 6.708e+01 1.002e+02 0.669 0.503497
## MaritalStatusSingle 1.128e+01 1.361e+02 0.083 0.933978
## MonthlyRate -9.505e-03 5.143e-03 -1.848 0.064946 .
## NumCompaniesWorked 5.421e+00 1.691e+01 0.321 0.748622
## OverTimeYes -1.394e+01 8.434e+01 -0.165 0.868787
## PercentSalaryHike 2.586e+01 1.581e+01 1.635 0.102351
## PerformanceRating -3.235e+02 1.614e+02 -2.004 0.045368 *
## RelationshipSatisfaction 1.640e+01 3.339e+01 0.491 0.623375
## StandardHours NA NA NA NA
## StockOptionLevel -2.758e+00 5.740e+01 -0.048 0.961695
## TotalWorkingYears 5.080e+01 1.098e+01 4.627 4.3e-06 ***
## TrainingTimesLastYear 2.436e+01 2.912e+01 0.837 0.403111
## WorkLifeBalance -3.472e+01 5.161e+01 -0.673 0.501284
## YearsAtCompany -2.750e+00 1.370e+01 -0.201 0.840925
## YearsInCurrentRole 3.398e+00 1.711e+01 0.199 0.842584
## YearsSinceLastPromotion 3.084e+01 1.532e+01 2.013 0.044405 *
## YearsWithCurrManager -2.691e+01 1.669e+01 -1.613 0.107210
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1055 on 823 degrees of freedom
## Multiple R-squared: 0.9501, Adjusted R-squared: 0.9473
## F-statistic: 340.7 on 46 and 823 DF, p-value: < 2.2e-16 #p val < alpha of .05, it affects the Salary Variable
#Model1_Preds = predict(Model1_fit, newdata = NoSalary) #this is an example of predict function you would want to use
#as.data.frame(Model1_Preds)
#write.csv(Model1_Preds,"Model1PredictionsNoSalaryRenuKarthikeyan.csv") Looking at this summary of model 1 output, it indicates that the statistically significant p values are Business Travel, JobLevel, Job Role, Performance rating, Total working Years, and Years since last promotion. The F-statistic tests the overall significance of the model. The F-statistic is 340.7 with a very small p-value (< 2.2e-16), suggests that at least one predictor variable is significantly related to Monthly Income.There are two coefficients not defined because of singularities. This might indicate multicollinearity, where two or more predictor variables are highly correlated.
Linear Regression using Select predictors to Determine Salary
Model2_fit = lm(MonthlyIncome ~ NumCompaniesWorked + Age + Gender + MaritalStatus + JobInvolvement + JobRole + DistanceFromHome + JobLevel + Education, data = Attritiondata)
summary(Model2_fit) # P value overall implies that at least one of my variables' slope != 0. ##
## Call:
## lm(formula = MonthlyIncome ~ NumCompaniesWorked + Age + Gender +
## MaritalStatus + JobInvolvement + JobRole + DistanceFromHome +
## JobLevel + Education, data = Attritiondata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3536.4 -699.4 -39.1 659.8 4178.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -600.926 320.383 -1.876 0.06105 .
## NumCompaniesWorked 17.866 15.366 1.163 0.24526
## Age 12.875 4.962 2.595 0.00963 **
## GenderMale 120.277 75.308 1.597 0.11061
## MaritalStatusMarried 106.159 95.406 1.113 0.26615
## MaritalStatusSingle 23.451 103.944 0.226 0.82155
## JobInvolvement 15.469 52.613 0.294 0.76882
## JobRoleHuman Resources -327.849 255.950 -1.281 0.20057
## JobRoleLaboratory Technician -534.026 172.285 -3.100 0.00200 **
## JobRoleManager 3933.510 232.847 16.893 < 2e-16 ***
## JobRoleManufacturing Director 92.825 169.877 0.546 0.58492
## JobRoleResearch Director 3919.755 218.518 17.938 < 2e-16 ***
## JobRoleResearch Scientist -246.097 171.976 -1.431 0.15280
## JobRoleSales Executive -122.894 146.575 -0.838 0.40202
## JobRoleSales Representative -392.881 217.075 -1.810 0.07067 .
## DistanceFromHome -7.908 4.553 -1.737 0.08281 .
## JobLevel 3042.789 69.854 43.559 < 2e-16 ***
## Education -33.713 37.379 -0.902 0.36736
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1078 on 852 degrees of freedom
## Multiple R-squared: 0.9461, Adjusted R-squared: 0.945
## F-statistic: 878.9 on 17 and 852 DF, p-value: < 2.2e-16 NoSalary$MonthlyIncome = predict(Model2_fit,newdata = NoSalary)
#Model2_Preds<- NoSalary %>% select(c("ID","MonthlyIncome"))
#as.data.frame(Model2_Preds)
#write.csv(Model2_Preds,"Model2PredictionsNoSalaryRenuKarthikeyan.csv", row.names = T) These select predictors were chosen because of the insights from the EDA. I thought they were significant predictors. The coefficient for JobLevel is 3042.789. This suggests that, on average, an increase of one unit in JobLevel is associated with an increase of 3042.789 dollars in MonthlyIncome.Likewise, Age has a coefficient of 12.875 indicating that a year increase in age, results in 12.88 dollars additional monthly income, holding all other variables constant. Males have a coefficient of 120, indicating that holding all other variables constant, males make an additional 120 dollars compared to women monthly. The coefficient for distance from home has a. -7.908, which indicates each additional mile away from home may result in a monthly salary decrease by -7 dollars.
The statistically significant p values (<.10) are Age, certain Job Roles – like Laboratory Technician, Manager, and Research director, Job Level;
Overall, Model 2 appears to have a high R-squared value, indicating a good fit to the data. Many predictors are statistically significant, suggesting they contribute to determining Monthly Income
Split and Train Linear Regression Models to Predict Salary, with average of cross validation
set.seed(1234)
TrainObs = sample(seq(1,dim(Attritiondata)[1]),round(.8*dim(Attritiondata)[1]),replace = FALSE)
SalaryTrain = Attritiondata[TrainObs,]
head(SalaryTrain,5) ## ID Age Attrition BusinessTravel DailyRate Department
## 284 284 31 No Travel_Rarely 691 Sales
## 848 848 39 No Travel_Rarely 1132 Research & Development
## 101 101 27 No Travel_Rarely 1377 Research & Development
## 623 623 34 No Travel_Rarely 182 Research & Development
## 645 645 41 Yes Non-Travel 906 Research & Development
## DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 284 7 3 Marketing 1 438
## 848 1 3 Medical 1 417
## 101 11 1 Life Sciences 1 1434
## 623 1 4 Life Sciences 1 797
## 645 5 2 Life Sciences 1 1210
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 284 4 Male 73 3 2
## 848 3 Male 48 4 3
## 101 2 Male 91 3 1
## 623 2 Female 72 4 1
## 645 1 Male 95 2 1
## JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 284 Sales Executive 4 Divorced 7547
## 848 Healthcare Representative 4 Divorced 9613
## 101 Laboratory Technician 1 Married 2099
## 623 Research Scientist 4 Single 3280
## 645 Research Scientist 1 Divorced 2107
## MonthlyRate NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating
## 284 7143 4 No 12 3
## 848 10942 0 No 17 3
## 101 7679 0 No 14 3
## 623 13551 2 No 16 3
## 645 20293 6 No 17 3
## RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## 284 4 80 3 13
## 848 1 80 3 19
## 101 2 80 0 6
## 623 3 80 0 10
## 645 1 80 1 5
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 284 3 3 7 7
## 848 5 2 18 10
## 101 3 4 5 0
## 623 2 3 4 2
## 645 2 1 1 0
## YearsSinceLastPromotion YearsWithCurrManager
## 284 1 7
## 848 3 7
## 101 1 4
## 623 1 3
## 645 0 0 SalaryTest = Attritiondata[-TrainObs,]
head(SalaryTest,5) ## ID Age Attrition BusinessTravel DailyRate Department
## 5 5 24 No Travel_Frequently 567 Research & Development
## 8 8 37 No Travel_Rarely 309 Sales
## 9 9 34 No Travel_Rarely 1333 Sales
## 16 16 31 No Non-Travel 1188 Sales
## 18 18 46 No Non-Travel 1144 Research & Development
## DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 5 2 1 Technical Degree 1 1646
## 8 10 4 Life Sciences 1 1105
## 9 10 4 Life Sciences 1 1055
## 16 20 2 Marketing 1 947
## 18 7 4 Medical 1 487
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 5 1 Female 32 3 1
## 8 4 Female 88 2 2
## 9 3 Female 87 3 1
## 16 4 Female 45 3 2
## 18 3 Female 30 3 2
## JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 5 Research Scientist 4 Single 3760
## 8 Sales Executive 4 Divorced 6694
## 9 Sales Representative 3 Married 2220
## 16 Sales Executive 3 Married 6932
## 18 Manufacturing Director 3 Married 5258
## MonthlyRate NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating
## 5 17218 1 Yes 13 3
## 8 24223 2 Yes 14 3
## 9 18410 1 Yes 19 3
## 16 24406 1 No 13 3
## 18 16044 2 No 14 3
## RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## 5 3 80 0 6
## 8 3 80 3 8
## 9 4 80 1 1
## 16 4 80 1 9
## 18 3 80 0 7
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 5 2 3 6 3
## 8 5 3 1 0
## 9 2 3 1 1
## 16 2 2 9 8
## 18 2 4 1 0
## YearsSinceLastPromotion YearsWithCurrManager
## 5 1 3
## 8 0 0
## 9 0 0
## 16 0 0
## 18 0 0 Model1_fit <- lm(MonthlyIncome ~ ., data = Attritiondata)
summary(Model1_fit) ##
## Call:
## lm(formula = MonthlyIncome ~ ., data = Attritiondata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3708.8 -674.1 14.7 614.1 4100.0
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.013e+01 7.772e+02 0.077 0.938349
## ID -2.343e-01 1.476e-01 -1.588 0.112713
## Age -1.431e+00 5.649e+00 -0.253 0.800110
## AttritionYes 8.904e+01 1.154e+02 0.771 0.440729
## BusinessTravelTravel_Frequently 1.895e+02 1.420e+02 1.334 0.182441
## BusinessTravelTravel_Rarely 3.720e+02 1.200e+02 3.099 0.002005 **
## DailyRate 1.452e-01 9.129e-02 1.591 0.112062
## DepartmentResearch & Development 1.234e+02 4.768e+02 0.259 0.795866
## DepartmentSales -4.594e+02 4.877e+02 -0.942 0.346580
## DistanceFromHome -6.237e+00 4.578e+00 -1.362 0.173417
## Education -3.743e+01 3.716e+01 -1.007 0.314105
## EducationFieldLife Sciences 1.352e+02 3.692e+02 0.366 0.714248
## EducationFieldMarketing 1.377e+02 3.914e+02 0.352 0.725050
## EducationFieldMedical 3.326e+01 3.699e+02 0.090 0.928376
## EducationFieldOther 9.152e+01 3.946e+02 0.232 0.816664
## EducationFieldTechnical Degree 9.680e+01 3.843e+02 0.252 0.801179
## EmployeeCount NA NA NA NA
## EmployeeNumber 8.681e-02 6.103e-02 1.422 0.155269
## EnvironmentSatisfaction -6.267e+00 3.364e+01 -0.186 0.852252
## GenderMale 1.100e+02 7.442e+01 1.478 0.139715
## HourlyRate -3.591e-01 1.824e+00 -0.197 0.844003
## JobInvolvement 1.677e+01 5.321e+01 0.315 0.752698
## JobLevel 2.783e+03 8.340e+01 33.375 < 2e-16 ***
## JobRoleHuman Resources -2.053e+02 5.157e+02 -0.398 0.690663
## JobRoleLaboratory Technician -5.891e+02 1.714e+02 -3.437 0.000618 ***
## JobRoleManager 4.280e+03 2.830e+02 15.122 < 2e-16 ***
## JobRoleManufacturing Director 1.809e+02 1.696e+02 1.067 0.286497
## JobRoleResearch Director 4.077e+03 2.193e+02 18.592 < 2e-16 ***
## JobRoleResearch Scientist -3.494e+02 1.705e+02 -2.049 0.040790 *
## JobRoleSales Executive 5.263e+02 3.576e+02 1.472 0.141449
## JobRoleSales Representative 8.531e+01 3.918e+02 0.218 0.827703
## JobSatisfaction 3.278e+01 3.344e+01 0.980 0.327278
## MaritalStatusMarried 6.708e+01 1.002e+02 0.669 0.503497
## MaritalStatusSingle 1.128e+01 1.361e+02 0.083 0.933978
## MonthlyRate -9.505e-03 5.143e-03 -1.848 0.064946 .
## NumCompaniesWorked 5.421e+00 1.691e+01 0.321 0.748622
## OverTimeYes -1.394e+01 8.434e+01 -0.165 0.868787
## PercentSalaryHike 2.586e+01 1.581e+01 1.635 0.102351
## PerformanceRating -3.235e+02 1.614e+02 -2.004 0.045368 *
## RelationshipSatisfaction 1.640e+01 3.339e+01 0.491 0.623375
## StandardHours NA NA NA NA
## StockOptionLevel -2.758e+00 5.740e+01 -0.048 0.961695
## TotalWorkingYears 5.080e+01 1.098e+01 4.627 4.3e-06 ***
## TrainingTimesLastYear 2.436e+01 2.912e+01 0.837 0.403111
## WorkLifeBalance -3.472e+01 5.161e+01 -0.673 0.501284
## YearsAtCompany -2.750e+00 1.370e+01 -0.201 0.840925
## YearsInCurrentRole 3.398e+00 1.711e+01 0.199 0.842584
## YearsSinceLastPromotion 3.084e+01 1.532e+01 2.013 0.044405 *
## YearsWithCurrManager -2.691e+01 1.669e+01 -1.613 0.107210
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1055 on 823 degrees of freedom
## Multiple R-squared: 0.9501, Adjusted R-squared: 0.9473
## F-statistic: 340.7 on 46 and 823 DF, p-value: < 2.2e-16 #Model1_Preds = predict(Model1_fit, newdata = NoSalary)
#as.data.frame(Model1_Preds)
#write.csv(Model1_Preds,"Model1PredictionsNoSalaryRenuKarthikeyan")
#Cross Validation and Mean Square Predictor Error Calculation
numMSPEs = 1000
MSPEHolderModel1 = numeric(numMSPEs)
MSPEHolderModel2 = numeric(numMSPEs)
RMSEHolderModel1 = numeric(numMSPEs)
RMSEHolderModel2 = numeric(numMSPEs)
for (i in 1:numMSPEs)
{
TrainObs = sample(seq(1,dim(Attritiondata)[1]),round(.8*dim(Attritiondata)[1]),replace = FALSE)
SalaryTrain = Attritiondata[TrainObs,]
head(SalaryTrain,5)
SalaryTest = Attritiondata[-TrainObs,]
head(SalaryTest,5)
Model1_fit <- lm(MonthlyIncome ~ ., data = SalaryTrain)
Model1_Preds = predict(Model1_fit, newdata = SalaryTest)
#MSPE Model 1
MSPE = mean((SalaryTest$MonthlyIncome - Model1_Preds)^2)
MSPE
MSPEHolderModel1[i] = MSPE
RMSEHolderModel1[i] = sqrt(MSPE)
#Model 2
Model2_fit = lm(MonthlyIncome ~ NumCompaniesWorked + Age + Gender + MaritalStatus + JobInvolvement + JobRole + DistanceFromHome + JobLevel + Education, data = SalaryTrain)
Model2_Preds = predict(Model2_fit,newdata = SalaryTest)
MSPE = mean((SalaryTest$MonthlyIncome - Model2_Preds)^2)
MSPE
MSPEHolderModel2[i] = MSPE
RMSEHolderModel2[i] = sqrt(MSPE)
}
mean(MSPEHolderModel1) ## [1] 1196649 mean(MSPEHolderModel2) ## [1] 1194877 mean(RMSEHolderModel1) ## [1] 1091.936 mean(RMSEHolderModel2) ## [1] 1091.156 AIC(Model1_fit) ## [1] 11699.08 AIC(Model2_fit) ## [1] 11698.73 - Data was split 80/20 – for training and testing
- Ran both model 1 and 2 to predict Monthly Income on the testing set
- Cross Validation Process with 1000 iterations where MSPE and RMSE are calculated for both models
- Summary Statistics of mean MSPE and mean RMSE for both models
Conclusion: Model 2 is the better fit as it has a lower mean RMSE and lower mean MSPE
KNN and Naive Bayes
When I initially attempted to use all predictors for KNN, each time, the model would unfortunately run into errors and say “Warning: NAs introduced by coercionError in knn(train_predictors, test_predictors, response_train, prob = TRUE,: NA/NaN/Inf in foreign function call (arg 6)”, although there are no missing values in the Attritiondata data set. I later realized that it was due to NAs being assigned to the categorical variables during the KNN chunk. I created dummy columns for these categorical variables to get the KNN to function without running into errors. Below is the code used to create the dummy columns from the FastDummies package in R.
Preparation of the Data - Including Dummy Columns for Categorical Variables
Attritiondata$BusinessTravel<- as.factor(Attritiondata$BusinessTravel)
Attritiondata$Department<- as.factor(Attritiondata$Department)
Attritiondata$EducationField<- as.factor(Attritiondata$EducationField)
Attritiondata$Gender<- as.factor(Attritiondata$Gender)
Attritiondata$JobRole<- as.factor(Attritiondata$JobRole)
Attritiondata$MaritalStatus<- as.factor(Attritiondata$MaritalStatus)
Attritiondata$OverTime<- as.factor(Attritiondata$OverTime)
Attritiondata<-dummy_cols(Attritiondata,select_columns=c("BusinessTravel","MaritalStatus","JobRole","Department","EducationField","OverTime","Gender"))
Attritiondata <- Attritiondata %>% select(-c("BusinessTravel","MaritalStatus","JobRole","Department","EducationField","OverTime","Gender"))
#Doing Same for Attrition Test Data Set (AttritionTest)
AttritionTest$BusinessTravel<- as.factor(AttritionTest$BusinessTravel)
AttritionTest$Department<- as.factor(AttritionTest$Department)
AttritionTest$EducationField<- as.factor(AttritionTest$EducationField)
AttritionTest$Gender<- as.factor(AttritionTest$Gender)
AttritionTest$JobRole<- as.factor(AttritionTest$JobRole)
AttritionTest$MaritalStatus<- as.factor(AttritionTest$MaritalStatus)
AttritionTest$OverTime<- as.factor(AttritionTest$OverTime)
AttritionTest<-dummy_cols(AttritionTest,select_columns=c("BusinessTravel","MaritalStatus","JobRole","Department","EducationField","OverTime","Gender"))
AttritionTest <- AttritionTest %>% select(-c("BusinessTravel","MaritalStatus","JobRole","Department","EducationField","OverTime","Gender"))
diff_columns_df1 <- setdiff(names(Attritiondata), names(AttritionTest))
cat("Columns in df2 but not in df1:", paste(diff_columns_df1, collapse = ", "), "\n") ## Columns in df2 but not in df1: Attrition KNN Model and Confusion Matrix - Training with All predictors
set.seed(1234)
iterations <- 100
numks <- 10
splitPerc <- 0.8
masterAcc <- matrix(nrow = iterations, ncol = numks)
for (j in 1:iterations) {
trainIndices <- sample(1:dim(Attritiondata)[1], round(splitPerc * dim(Attritiondata)[1]))
train <- as.data.frame(Attritiondata[trainIndices, ])
test <- as.data.frame(Attritiondata[-trainIndices, ])
response_variable <- "Attrition"
response_train <- factor(train[[response_variable]])
response_test <- factor(test[[response_variable]])
# Select columns for predictors
selected_columns <- c(1, 2, 4:36) # Adjust this range as needed
#selected_columns <- c("ID","Age","BusinessTravel","DailyRate","Department", "DistanceFromHome", "Education", "EducationField", "EmployeeCount", "EmployeeNumber","EnvironmentSatisfaction", "Gender", "HourlyRate", "JobInvolvement", "JobLevel", "JobRole", "JobSatisfaction", "MaritalStatus", "MonthlyIncome", "MonthlyRate", "NumCompaniesWorked", "OverTime", "PercentSalaryHike", "PerformanceRating", "RelationshipSatisfaction", "StandardHours", "StockOptionLevel", "TotalWorkingYears", "TrainingTimesLastYear", "WorkLifeBalance", "YearsAtCompany", "YearsInCurrentRole", "YearsSinceLastPromotion", "YearsWithCurrManager")
# Extract the selected columns
train_predictors <- train[, selected_columns, drop = FALSE]
test_predictors <- test[, selected_columns, drop = FALSE]
train_predictors <- apply(train_predictors, 2, as.numeric)
test_predictors <- apply(test_predictors, 2, as.numeric)
train_predictors <- scale(train_predictors)
test_predictors <- scale(test_predictors)
# Convert to numeric matrices
train_predictors <- as.matrix(train_predictors)
test_predictors <- as.matrix(test_predictors)
# Remove infinite values
train_predictors[!is.finite(train_predictors)] <- 0
test_predictors[!is.finite(test_predictors)] <- 0
if (sum(response_train == "Yes") > 0 && sum(response_test == "Yes") > 0) {
for (i in 1:numks) {
classifications <- knn(train_predictors, test_predictors, response_train, prob = TRUE, k = i)
table(classifications, response_test)
CM_AllK<- confusionMatrix(table(classifications, response_test), positive = "Yes")
masterAcc[j, i] <- CM_AllK$overall[1]
}
}
}
CM_AllK ## Confusion Matrix and Statistics
##
## response_test
## classifications No Yes
## No 143 26
## Yes 2 3
##
## Accuracy : 0.8391
## 95% CI : (0.7759, 0.8903)
## No Information Rate : 0.8333
## P-Value [Acc > NIR] : 0.4685
##
## Kappa : 0.134
##
## Mcnemar's Test P-Value : 1.383e-05
##
## Sensitivity : 0.10345
## Specificity : 0.98621
## Pos Pred Value : 0.60000
## Neg Pred Value : 0.84615
## Prevalence : 0.16667
## Detection Rate : 0.01724
## Detection Prevalence : 0.02874
## Balanced Accuracy : 0.54483
##
## 'Positive' Class : Yes
## MeanAcc = colMeans(masterAcc); MeanAcc ## [1] 0.7946552 0.7893103 0.8301724 0.8294828 0.8386782 0.8394828 0.8433333
## [8] 0.8406897 0.8425287 0.8426437 plot(seq(1, numks, 1), MeanAcc, type = "l", ylab = "Mean Accuracy (Positive Class: Yes)") which.max(MeanAcc) ## [1] 7 max(MeanAcc) ## [1] 0.8433333 From the plot, we see that the best k is k = 7. The overall accuracy is 83.91%, but sensitivity (True Positive Rate) is low (10.35%).The model is better at correctly predicting the majority class (“No”) but struggles with the minority class (“Yes”).
Looking specifically at the Confusion Matrix statistics, this is the output and the interpretation of each of these statistics: - Sensitivity (True Positive Rate): 0.10345 The proportion of actual positives correctly predicted for those who attrited. - Specificity (True Negative Rate): 0.98621 The proportion of actual negatives correctly predicted (for those who did not leave the company) - Positive Predictive Value (Precision): 0.60000 The proportion of predicted positives that are true positives (attrited correctly identified as attrited) - Negative Predictive Value: 0.84615 The proportion of predicted negatives that are true negatives. (not attrited correctly identified as not attrited) - Prevalence: 0.16667 The proportion of actual positives in the dataset. (proportion of attrited in the overall dataset)
TRYING THRESHOLD CHANGE for KNN - All predictors; k = 7
set.seed(1234)
iterations<- 100
accuracy_table <- numeric(iterations)
trainIndices <- sample(1:dim(Attritiondata)[1], round(splitPerc * dim(Attritiondata)[1]))
accuracy_table <- numeric(iterations)
train <- as.data.frame(Attritiondata[trainIndices, ])
test <- as.data.frame(Attritiondata[-trainIndices, ])
train_features<- train[, -which(names(train) == "Attrition")]
test_features<- test[, -which(names(test) == "Attrition")]
train_target<- train$Attrition
# train_scaled<- scale(train_features) Using this in the knn returns "No missing values allowed"
#test_scaled<- scale(test_features)
for (i in 1:iterations) {
classifications <- knn(train_features, test_features, train_target, prob = TRUE, k = 7)
table(classifications, response_test)
CM_AllK7 <- confusionMatrix(table(as.factor(test$Attrition),classifications), positive ="Yes")
accuracy_table[i] <- CM_AllK7$overall[1]
}
#print(accuracy_table)
avg_accuracy<-mean(accuracy_table[1])
avg_accuracy ## [1] 0.7988506 specificity_table<- numeric(iterations)
sensitivity_table<- numeric(iterations)
for (i in 1:iterations) {
classifications <- knn(train_features, test_features, train_target, prob = TRUE, k = 7)
table(classifications, response_test)
CM_AllK7 <- confusionMatrix(table(as.factor(test$Attrition),classifications), positive ="Yes")
specificity_table[i] <- CM_AllK7$byClass['Specificity']
sensitivity_table[i] <- CM_AllK7$byClass['Sensitivity']
}
CM_AllK7 ## Confusion Matrix and Statistics
##
## classifications
## No Yes
## No 137 3
## Yes 32 2
##
## Accuracy : 0.7989
## 95% CI : (0.7315, 0.8557)
## No Information Rate : 0.9713
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0552
##
## Mcnemar's Test P-Value : 2.214e-06
##
## Sensitivity : 0.40000
## Specificity : 0.81065
## Pos Pred Value : 0.05882
## Neg Pred Value : 0.97857
## Prevalence : 0.02874
## Detection Rate : 0.01149
## Detection Prevalence : 0.19540
## Balanced Accuracy : 0.60533
##
## 'Positive' Class : Yes
## avg_specificity<-mean(specificity_table[1])
avg_specificity ## [1] 0.8106509 avg_sensitivity<-mean(sensitivity_table[1])
avg_sensitivity ## [1] 0.4 ###### New Threshold using classifications which used k = 7 from above
#classifications
#attributes(classifications) # Look at possible attributes
#attributes(classifications)$prob # Probability of what was classified for that observation
probs = ifelse(classifications == "Yes",attributes(classifications)$prob, 1- attributes(classifications)$prob)
summary(Attritiondata$Attrition) ## No Yes
## 730 140 140/(730+140) #16.09% ## [1] 0.1609195 NewClass = ifelse(probs > .1609, "Yes", "No")
NewClass <- factor(NewClass, levels = levels(response_test))
table(NewClass,response_test) ## response_test
## NewClass No Yes
## No 115 18
## Yes 30 11 CM_NewThreshold <- confusionMatrix(table(NewClass, response_test), positive = "Yes", mode = "everything")
CM_NewThreshold ## Confusion Matrix and Statistics
##
## response_test
## NewClass No Yes
## No 115 18
## Yes 30 11
##
## Accuracy : 0.7241
## 95% CI : (0.6514, 0.7891)
## No Information Rate : 0.8333
## P-Value [Acc > NIR] : 0.9999
##
## Kappa : 0.1479
##
## Mcnemar's Test P-Value : 0.1124
##
## Sensitivity : 0.37931
## Specificity : 0.79310
## Pos Pred Value : 0.26829
## Neg Pred Value : 0.86466
## Precision : 0.26829
## Recall : 0.37931
## F1 : 0.31429
## Prevalence : 0.16667
## Detection Rate : 0.06322
## Detection Prevalence : 0.23563
## Balanced Accuracy : 0.58621
##
## 'Positive' Class : Yes
## Similar to before, the loop ran for 100 iterations, but k was set to 7 when it came to knn. There are 2 confusion matrices. We have one, without the threshold changes, and one with the threshold change. The accuracy went down with the threshold change, while sensitivity reduced by a lot, and specificity reduced by ~10%. The positive predictive value increased by 9%, and the negative pred value reduced by 14%.
The new threshold has a lower accuracy compared to the original kNN classification.Sensitivity is significantly lower for the new threshold, indicating that fewer true positives are captured.Specificity is slightly lower for the new threshold, indicating a decrease in correctly identified true negatives. Precision is lower for the new threshold, reflecting a decrease in the accuracy of positive predictions. The original KNN classification has a higher accuracy and sensitivity compared to the new threshold.
#Applying to the Test Model to predict attrition using KNN with all predictors (This is best model)
train_features<- Attritiondata[, -which(names(Attritiondata) == "Attrition")]
test_features<- AttritionTest[]#[, -which(names(AttritionTest)=="Attrition")]
#head(test_features,5)
train_target <- Attritiondata$Attrition
AttritionClassifications <- knn(train_features, test_features, train_target, prob = TRUE, k = 7)
AttritionTest$Attrition <- AttritionClassifications
#head(AttritionTest,5)
AttritionPredictionsKNN<-AttritionTest%>%select(c("ID","Attrition"))
#head(AttritionPredictionsKNN,5)
write.csv(AttritionPredictionsKNN,"CaseStudy2AttritionPredictionsKNN_RenuKarthikeyan.csv", row.names = T) KNN Using Select Predictors
After doing my Exploratory Data Analysis, I believe the important predictors for Attrition are: Gender, Department,Job Satisfaction,Distance From Home,Monthly Income, Job Level, Age, Over Time, Percent Salary Hike, Performance Rating, and Education.
set.seed(1234)
iterations = 100
numks = 10
splitPerc = .8
masterAcc = matrix(nrow = iterations, ncol = numks)
for (j in 1:iterations) {
trainIndices <- sample(1:dim(Attritiondata)[1], round(splitPerc * dim(Attritiondata)[1]))
train <- as.data.frame(Attritiondata[trainIndices, ])
test <- as.data.frame(Attritiondata[-trainIndices, ])
response_train <- factor(train$Attrition)
response_test <- factor(test$Attrition)
#UPDATE HERE!!!!
#train_predictors <- train[, c(2,6,7,8,13,16,18,20,24,25,26)]
#test_predictors <- test[, c(2,6,7,8,13,16,18,20,24,25,26) ]
selected_cols <- c(2, 6, 7, 8, 13, 16, 18, 20, 24, 25, 26)
train_predictors <- train[, selected_cols] #subset function
test_predictors <- test[, selected_cols]
train_predictors <- apply(train_predictors, 2, as.numeric)
test_predictors <- apply(test_predictors, 2, as.numeric)
train_predictors <- scale(train_predictors)
test_predictors <- scale(test_predictors)
train_predictors <- as.matrix(train_predictors)
test_predictors <- as.matrix(test_predictors)
# Remove infinite values
train_predictors[!is.finite(train_predictors)] <- 0
test_predictors[!is.finite(test_predictors)] <- 0
for (i in 1:numks) {
classifications <- knn(train_predictors, test_predictors, response_train, prob =TRUE, k = i)
table(classifications, response_test)
CM_Select<- confusionMatrix(table(classifications, response_test), positive = "Yes")
masterAcc[j, i] <- CM_Select$overall[1]
}
}
CM_Select ## Confusion Matrix and Statistics
##
## response_test
## classifications No Yes
## No 144 24
## Yes 4 2
##
## Accuracy : 0.8391
## 95% CI : (0.7759, 0.8903)
## No Information Rate : 0.8506
## P-Value [Acc > NIR] : 0.7085939
##
## Kappa : 0.0731
##
## Mcnemar's Test P-Value : 0.0003298
##
## Sensitivity : 0.07692
## Specificity : 0.97297
## Pos Pred Value : 0.33333
## Neg Pred Value : 0.85714
## Prevalence : 0.14943
## Detection Rate : 0.01149
## Detection Prevalence : 0.03448
## Balanced Accuracy : 0.52495
##
## 'Positive' Class : Yes
## MeanAcc = colMeans(masterAcc); MeanAcc ## [1] 0.7545977 0.7523563 0.8077586 0.8095977 0.8272414 0.8263793 0.8315517
## [8] 0.8305172 0.8338506 0.8359195 plot(seq(1, numks, 1), MeanAcc, type = "l", ylab = "Mean Accuracy (Positive Class: Yes)") which.max(MeanAcc) ## [1] 10 max(MeanAcc) ## [1] 0.8359195 We see that the best k is k = 10. The confusion matrix is taking an average of all the k’s tried. Here, the accuracy is 83.91%, similar to the initial average confusion matrix seen for all predictors using KNN. The sensitivity is quite low at 7.69%, suggesting that the model is struggling to correctly identify positive (true Attrition) instances. The model shows high specificity (97.30%), indicating a decent ability to correctly identify negative instances (not attrition). The positive predictive value (precision) is at 33.33%, indicating that among instances predicted as positive, about one-third are true positives.
TRYING k = 10 for KNN - Select predictors
iterations<- 100
accuracy_table <- numeric(iterations)
trainIndices <- sample(1:dim(Attritiondata)[1], round(splitPerc * dim(Attritiondata)[1]))
train <- as.data.frame(Attritiondata[trainIndices, ])
test <- as.data.frame(Attritiondata[-trainIndices, ])
train_features<- train[, selected_cols]
test_features<- test[, selected_cols]
train_target<- train$Attrition
train_scaled<- scale(train_features)
test_scaled<- scale(test_features)
for (i in 1:iterations) {
classifications <- knn(train_features, test_features, train_target, prob = TRUE, k = 10)
table(classifications, response_test)
CM_SelectK <- confusionMatrix(table(as.factor(test$Attrition),classifications), positive ="Yes")
accuracy_table[i] <- CM_SelectK$overall[1]
}
#print(accuracy_table)
avg_accuracy<-mean(accuracy_table[1]); avg_accuracy ## [1] 0.8275862 specificity_table<- numeric(iterations)
sensitivity_table<- numeric(iterations)
for (i in 1:iterations) {
classifications <- knn(train_features, test_features, train_target, prob = TRUE, k = 10)
table(classifications, response_test)
CM_SelectK<- confusionMatrix(table(as.factor(test$Attrition),classifications), positive ="Yes")
specificity_table[i] <- CM_SelectK$byClass['Specificity']
sensitivity_table[i] <- CM_SelectK$byClass['Sensitivity']
}
CM_SelectK ## Confusion Matrix and Statistics
##
## classifications
## No Yes
## No 145 0
## Yes 29 0
##
## Accuracy : 0.8333
## 95% CI : (0.7695, 0.8854)
## No Information Rate : 1
## P-Value [Acc > NIR] : 1
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.999e-07
##
## Sensitivity : NA
## Specificity : 0.8333
## Pos Pred Value : NA
## Neg Pred Value : NA
## Prevalence : 0.0000
## Detection Rate : 0.0000
## Detection Prevalence : 0.1667
## Balanced Accuracy : NA
##
## 'Positive' Class : Yes
## avg_specificity<-mean(specificity_table[1])
avg_specificity ## [1] 0.8323699 avg_sensitivity<-mean(sensitivity_table[1])
avg_sensitivity ## [1] 0 We see the average specificity is 83.43% and sensitivity is 0%. This model’s performance is notably poor, with zero sensitivity, meaning it failed to correctly identify any positive instances. The specificity and negative predictive value are relatively high, but the lack of sensitivity indicates a serious limitation in identifying instances of the positive class. This suggests that the model might need further refinement or a different approach to address the imbalance and improve its ability to correctly classify positive instances.
Naive Bayes and Confusion Matrix - Training with All predictors
set.seed(1234)
iterations = 100
masterAcc = matrix(nrow = iterations)
splitPerc = .8 #Training / Test split Percentage
for(j in 1:iterations)
{
trainIndices <- sample(1:dim(Attritiondata)[1], round(splitPerc * dim(Attritiondata)[1]))
train <- as.data.frame(Attritiondata[trainIndices, ])
test <- as.data.frame(Attritiondata[-trainIndices, ])
train$Attrition <- factor(train$Attrition, levels = c("Yes", "No"))
test$Attrition <- factor(test$Attrition, levels = c("Yes", "No"))
model <- naiveBayes(train[, -3], as.factor(train$Attrition), laplace = 1)
predictions <- predict(model, test[, -3])
confMatrix <- table(predictions, as.factor(test$Attrition))
CM_NB_All <- confusionMatrix(confMatrix)
masterAcc[j] <- CM_NB_All$overall[1]
}
CM_NB_All ## Confusion Matrix and Statistics
##
##
## predictions Yes No
## Yes 23 92
## No 2 57
##
## Accuracy : 0.4598
## 95% CI : (0.3841, 0.5368)
## No Information Rate : 0.8563
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1211
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9200
## Specificity : 0.3826
## Pos Pred Value : 0.2000
## Neg Pred Value : 0.9661
## Prevalence : 0.1437
## Detection Rate : 0.1322
## Detection Prevalence : 0.6609
## Balanced Accuracy : 0.6513
##
## 'Positive' Class : Yes
## MeanAcc = colMeans(masterAcc); MeanAcc ## [1] 0.6482184 Given all the predictors, the sensitivity rate of the model is the percentage of actual attrition cases correctly identified. It measures the model’s ability to capture employees who are truly at risk of attrition among all employees who actually attrite. This model shows imbalanced performance with high sensitivity (high attrition) but low specificity. It performs well in identifying actual positive instances (attrition) but struggles to correctly identify negative instances(no attrition). The positive predictive value is relatively low, indicating that when it predicts a positive instance, it has a 20% chance of being correct.
Naive Bayes and Confusion Matrix - Training; Using Select predictors to determine Attrition
set.seed(1234)
iterations = 100
masterAcc = matrix(nrow = iterations)
splitPerc = .8 #Training / Test split Percentage
for(j in 1:iterations)
{
trainIndices <- sample(1:dim(Attritiondata)[1], round(splitPerc * dim(Attritiondata)[1]))
train <- as.data.frame(Attritiondata[trainIndices, ])
test <- as.data.frame(Attritiondata[-trainIndices, ])
train$Attrition <- factor(train$Attrition, levels = c("Yes", "No"))
test$Attrition <- factor(test$Attrition, levels = c("Yes", "No"))
model2 <- naiveBayes(train[, c(2, 6, 7, 8, 13, 16, 18, 20, 24, 25, 26)], as.factor(train$Attrition), laplace = 1)
predictions <- predict(model2, test[, c(2, 6, 7, 8, 13, 16, 18, 20, 24, 25, 26)])
confMatrix <- table(predictions, as.factor(test$Attrition))
CM_NB_Select <- confusionMatrix(confMatrix)
masterAcc[j] <- CM_NB_Select$overall[1]
}
CM_NB_Select ## Confusion Matrix and Statistics
##
##
## predictions Yes No
## Yes 3 4
## No 22 145
##
## Accuracy : 0.8506
## 95% CI : (0.7888, 0.9)
## No Information Rate : 0.8563
## P-Value [Acc > NIR] : 0.6357048
##
## Kappa : 0.133
##
## Mcnemar's Test P-Value : 0.0008561
##
## Sensitivity : 0.12000
## Specificity : 0.97315
## Pos Pred Value : 0.42857
## Neg Pred Value : 0.86826
## Prevalence : 0.14368
## Detection Rate : 0.01724
## Detection Prevalence : 0.04023
## Balanced Accuracy : 0.54658
##
## 'Positive' Class : Yes
## MeanAcc = colMeans(masterAcc)
MeanAcc ## [1] 0.8312069 #use predict function on the "validation" sets. Use the same model. Test will be validation set.
Predictions<- predict(model2,AttritionTest[,c(2, 5, 6, 7, 12, 15, 17, 19, 23, 24, 25)])
AttritionTest$Attrition<- Predictions
#View(AttritionTest$Attrition)
AttritionPredictionsNB<- AttritionTest %>% select(c("ID","Attrition"))
write.csv(AttritionPredictionsNB,"CaseStudy2AttritionPredictionsNB_RenuKarthikeyan.csv", row.names = T) Sensitivity is 12%; 12% of actual attrition cases are correctly identified by the model. This suggests that the model may not be very effective at capturing employees who are truly at risk of attrition. This model has a higher accuracy of 85.06%, and has imbalanced performance with low sensitivity and high specificity. However, accuracy can be misleading, especially in imbalanced datasets where one class (e.g., “No attrition”) dominates. In this case, accuracy is not the best metric to evaluate the model’s performance. Positive predictive value (PPV) is at 42.86%. This indicates that when the model predicts attrition, there’s a 42.86% chance that the prediction is correct. It reflects the precision of the model in identifying true positive cases among all instances predicted as positive. The low prevalence (the proportion of actual positive cases in the dataset) of attrition, is 14.37%. This low prevalence contributes to the imbalanced nature of the performance metrics.
Best Model Determination from the Models tried for Attrition (KNN and Naive Bayes)
It looks like the KNN model at k = 7 with all predictors without threshold adjustment has better accuracy within the KNN models. The Naive Bayes model with select predictors has the highest accuracy, and is better than the other Naive Bayes model which included all predictors.
Naive Bayes has a higher accuracy (85.06%) compared to KNN (79.89%). However, accuracy alone may not be the most informative metric, especially in imbalanced datasets.KNN has a higher sensitivity (40.00%) compared to Naive Bayes (12.00%). Sensitivity is crucial when identifying cases of attrition, as it represents the proportion of true positive cases among all actual positive cases.Naive Bayes has higher specificity (97.32%) compared to KNN (81.07%). Specificity is important when minimizing false positives, but it’s essential to balance it with sensitivity.Naïve Bayes has a higher positive predictive value (precision) at 42.86%, while KNN has a lower precision at 5.88%. Precision indicates the accuracy of positive predictions.
Of the 2 best models (best in KNN and best in Naive Bayes), I think the Naive Bayes is the better model to predict Attrition, given the high positive prediction value(precision), accuracy, and narrower confidence interval.
Thank You
This concludes this presentation and analysis. Thank you for your time and I look forward to empowering Frito Lay with data-driven wisdom. I created a shiny app to visualize and notice insights regarding the attrition data. Please feel free to look into the link provided. If you have any questions, please feel free to reach out to me, my email is in the attached PowerPoint presentation. Thank you!